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Abstract 

We present a new method for the detection of gene pathways associated with a multi- 
variate quantitative trait, and use it to identify causal pathways associated with an imaging 
endophenotype characteristic of longitudinal structural change in the brains of patients with 
Alzheimer's disease (AD). Our method, known as pathways sparse reduced-rank regression 
(PsRRR), uses group lasso penalised regression to jointly model the effects of genome- wide 
single nucleotide polymorphisms (SNPs), grouped into functional pathways using prior knowl- 
edge of gene- gene interactions. Pathways are ranked in order of importance using a resam- 
pling strategy that exploits finite sample variability. Our application study uses whole genome 
scans and MR images from 464 subjects in the Alzheimer's Disease Neuroimaging Initiative 
(ADNI) database. 66,182 SNPs are mapped to 185 gene pathways from the KEGG pathways 
database. Voxel- wise imaging signatures characteristic of AD are obtained by analysing 3D 
patterns of structural change at 6, 12 and 24 months relative to baseline. High-ranking, AD 
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endophenotype-associated pathways in our study include those describing chemokine, Jak- 
stat and insuhn signalUng pathways, and tight junction interactions. All of these have been 
previously implicated in AD biology. In a secondary analysis, we investigate SNPs and genes 
that may be driving pathway selection, and identify a number of previously validated AD 
genes including CRl, APOE and TOMM40. 

Keywords: Alzheimer's Disease, imaging genetics, atrophy, gene pathways, sparse regression 



1 Introduction 



A growing list of genetic variants have now been associated with greater susceptibility to develop 
early and late-onset Alzheimer's disease (AD), with the APOEeA allele consistently identified as 
having the greatest effect (for an up to date list see www . alz gene . org). Recently, case-control sus- 
ceptibility studies have been augmented by studies using neuroimaging phenotypes. The rationale 
here is that the use of heritable imaging signatures ('endophenotypes') of disease may increase 
the power to detect causal variants, since gene effects are expected to be more penetrant at this 
level ( | Meyer- Lindenb erg and Weinberger 2QQ6| ). This 'imaging-genetic' approach has been used to 
identify genes associated with a range of imaging phenotypes including measures of hippocampal 
volume (Stein et al, 2012) and cortical thickness (Burggren et a/., 2QQ8[ ). 

AD is a moderate to highly heritable condition, yet as with many common heritable diseases, 
association studies have to date identified gene variants explaining only a relatively modest amount 



of known AD heritability (Braskie et al. , 2011). One approach to uncovering this 'missing heri- 



tability' is motivated by the observation that in many cases disease states are likely to be driven 
by multiple genetic variants of small to moderate effect, mediated through their interaction in 
molecular networks or pathways, rather than by the effects of a few, highly penetrant mutations 



(Schadt, 2009). Where this assumption holds, the hope is that by considering the joint effects 



of multiple variants acting in concert, pathways genome- wide association studies (PGWAS) will 
reveal aspects of a disease's genetic architecture that would otherwise be missed when considering 



variants individually (Wang et al. 2010 Fridley and Biernacka 2011). Another potential benefit 



of the PGWAS approach is that it can help to elucidate the mechanisms of disease by providing a 



biological interpretation of association results (Cantor et al. 2010). In the case of AD for example 



an understanding of the underlying mechanisms by which gene mutations impact disease etiology 
may play an important role in the translation of basic AD biology into therapy and patient care 



dSleegers et a/.[|2010D . 

In this paper, we present the first PGWAS method that is able to accommodate a multivari- 
ate quantitative phenotype, and apply our method to a pathways analysis of the ADNI cohort, 
comparing genome-wide single nucleotide polymorphism (SNP) data with voxel-wise tensor-based 
morphometry (TBM) maps describing longitudinal structural changes that are characteristic of 
AD. In this study we map SNPs to pathways from the KEGG pathways database, a curated 
collection of functional gene pathways representing current knowledge of molecular interaction 
and reaction networks ( http : //www . genome . j p/kegg/ pathway . html ) . Our method is however 
able to accommodate alternative sources of information for the grouping of SNPs and genes, for 



2 



example using gene ontology (GO) terms, or information from protein interaction networks (Wu 
et a/.[|2QlQ[ [Jensen and Bork[ |2Q1Q| ). 



Many existing PGWAS methods, such as GenGen (Wang et al 2009) and ALLIGATOR (Hoi- 



mans et al. 



2009 ) rely on univariate statistics of association, whereby each SNP in the study is first 



independently tested for association with a univariate quantitative or dichotomous (case-control) 
phenotype. SNPs are assigned to pathways by mapping them to adjacent genes within a specified 
distance, and individual SNP or gene statistics are then combined across each pathway to give a 
measure of pathway significance, corrected for multiple testing. Methods must also account for 
the potentially biasing effects of gene and pathway size and linkage disequilibrium (LD), and this 
is generally done through permutation. A potential disadvantage of these methods is that each 
SNP is considered separately at the first step, with no account taken of SNP-SNP dependencies. 
In contrast, a multilocus or multivariate model that considers all SNPs simultaneously may char- 
acterise SNP effects more accurately by aiding the identification of weak signals while diminishing 
the importance of false ones ( [Hoggart et a/.[|2008| ). 

In earlier work we developed a multivariate PGWAS method for identifying pathways asso- 



ciated with a single quantitative trait (Silver and Montana, 2012). We used a sparse regression 



model - the group lasso - with SNPs grouped into pathways. We demonstrated in simulation stud- 
ies using real SNP and pathway data, that our method showed high sensitivity and specificity for 
the detection of important pathways, when compared with an alternative pathways method based 
on univariate SNP statistics. Our method showed the greatest relative gains in performance where 
marginal SNP effect sizes are small. Here we extend our previous model to accommodate the case 
of a multivariate neuroimaging phenotype. We do this by incorporating a group sparsity constraint 
on genotype coefficients in a multivariate sparse reduced-rank regression model, previously devel- 
oped for the identification of single causal variants ( [Vounou et al. , 2010). Our proposed 'Pathways 
Sparse Reduced-Rank Regression' (PsRRR) algorithm incorporates phenotypes and genotypes in 
a single model, and accounts for potential biasing factors such as dependencies between voxels and 
SNPs using an adaptive, weight-tuning procedure. 



The article is presented as follows. We begin in section 2.1 with a description of the voxel 



wise TBM maps used in the study, and in section 2.2 we outline how we use these maps to 
generate an imaging signature characteristic of structural change in AD, that is able to discriminate 



between AD patients and controls. In section 2.3 we describe the genotype data used in the study. 



together with quality control procedures, and in section [2^ we explain how this genotype data is 
mapped to gene pathways. The theoretical underpinnings of the PsRRR method are described in 
section [23] We explain our method for ranking AD-associated pathways, SNPs and genes using 



a resampling procedure in section |2.6[ and discuss our strategies for addressing the significant 
computational challenge of fitting a regression-based model with such high dimensional datasets 
in section 2/7 Pathway, SNP and gene ranking results are presented in section |3| and we conclude 
with a discussion in section [U 



2 Materials and methods 



Imaging and genotype data used in this study were obtained from the Alzheimer's Disease Neu- 
roimaging Initiative (ADNI) database (adni.loni.ucla.edu). The ADNI was launched in 2003 by 
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the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengi- 
neering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies 
and non-profit organizations, as a 5-year public-private partnership. The primary goal of ADNI 
has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography 
(PET), other biological markers, and clinical and neuropsychological assessment can be combined 
to measure the progression of mild cognitive impairment (MCI) and early AD. Determination 
of sensitive and specific markers of very early AD progression is intended to aid researchers and 
clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and 
cost of clinical trials. 



2.1 Imaging data 

Longitudinal brain MRI scans (1.5 Tesla) were downloaded from the ADNI public database (http : ' 
[//www. loni .ucla. edu/ADNI/Data/). Serial brain MRI scans (N=3512; see Table[l]) were analyzed 
from 200 probable AD patients, 410 individuals with amnestic MCI, and 232 healthy elderly 
controls (CN). Subjects were scanned at screening and followed up at 6, 12, 18, 24, 36, and 48 
months. All subjects were scanned with a standardized 1.5T MP-RAGE protocol developed for 
ADNI ( Jack et al , 2008 ) . The typical acquisition parameters were repetition time (TR) of 2400 
ms, minimum full echo time (TE), inversion time (TI) of 1000 ms, fiip angle of 8, 24 cm field of 
view, 192 X 192 x 166 acquisition matrix in the x— , and z— dimensions, yielding a voxel size 
of 1.25 X 1.25 X 1.2 mm^, later reconstructed to 1 mm isotropic voxels. Image correction steps 
included gradwarp (Jovicich et al., 2006), Bl-correction (Jack et al 2008), N3 bias field correction 



(Sled et al., 1998), and phantom-based geometrical scaling (Gunter et al. 2006). 



Linear registration (9-parameter) was used to align the longitudinal scan series of each subject 
and then the mutually aligned time-series was registered to the International Consortium for Brain 
Mapping template (ICBM-53) ( [Mazziotta et a/. , 2001). Brain masks that excluded skull, other 
non-brain tissues, and the image background were generated automatically using a parameter-less 
robust brain extraction tool (ROBEX) (Iglesias et al 2011). 

Individual Jacobian maps were created to estimate 3D patterns of structural brain change 
over time by warping the skull-stripped, globally registered and scaled follow-up scan to match 
the corresponding screening scan. We used a non-linear, inverse consistent, elastic intensity- 
based registration algorithm (Leow et al 2005), which optimizes a joint cost function based on 
mutual information (MI) and the elastic energy of the deformation. Color-coded maps of the 
Jacobian determinants were created to illustrate regions of ventricular/ CSF expansion (i.e., with 
det J(r) > 1), or brain tissue loss (i.e., with det J(r) < 1) ( [Ashburner and Friston , 2003[ [Chung 



et a/.[ |2001[ [Freeborough and Fox[ [l 998 [ [Riddle et a/. [ |2004] [Thompson et a/.[ |2000"1 |Toga[ |1999 ) 



over time. These longitudinal maps of tissue change were also spatially normalized across subjects 
by nonlinearly aligning all individual Jacobian maps to an average group template known as the 
minimal deformation target (MDT), for regional comparisons and group statistical analyses. 

The study was conducted according to the Good Clinical Practice guidelines, the Declaration 
of Helsinki and U.S. 21 CFR Part 50-Protection of Human Subjects, and Part 56-Institutional 
Review Boards. Written informed consent was obtained from all participants before experimental 
procedures, including cognitive tests, were performed. 
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Table 1: Available scans for the ADNI-1 dataset (downloaded on February 28, 2011) 





Screening 


6Mo 12Mo 


18Mo 


24Mo 36Mo 


48Mo 


AD 


200 


165 144 


n/a 


111 n/a 


n/a 


MCI 


410 


358 338 


296 


253 176 


41 


CN 


232 


214 202 


n/a 


178 149 


45 


Total 


842 


737 684 


296 


542 325 


86 






At screening: 








Group 


age (years) 


N male 


N female 






AD 


75.7±7.7 


103 


97 






MCI 


74.8±7.5 


264 


146 






CN 


76.0±5.0 


120 


112 





2.2 Voxel filtering 

To maximise the power to detect causal pathways, we seek a phenotype which is highly represen- 
tative of those structural changes in the brain that are characteristic of AD. One approach is to 
use prior knowledge on regions of interest (ROI) to extract a univariate quantitative measure as 



a disease signature (Potkin et al. 2009). We instead use a voxel-wise, data-driven approach to 
produce a multivariate disease signature. 

We begin by selecting 464 individuals (99 AD, 211 MCI, 154 CN) with longitudinal maps at 
time points 6, 12 and 24 months, who have been genotyped by ADNI. Other time points are 
excluded because of missing observations. For each voxel and for each one of the three groups, we 
then fit a linear regression with an intercept term, where the dependent variable is the voxel value 
(change relative to baseline at screening), and the independent variable is time. The regression 
coefficient for the slope thus gives a summary measure of tissue change over time at each voxel. We 
next perform an analysis of variance (ANOVA) on each slope coefficient to identify which voxels are 
discriminative between AD and CN, with sex and age as covariates (MCI subjects are not used at 
this stage as we wish to obtain a clear imaging-derived signature for AD). We then select the most 
discriminative voxels whose ANOVA p- values exceed a level of 0.05, with a Bonferroni correction 
for multiple testing. The final set of phenotypes used in the study correspond to the voxel-wise 
slope coefficients for all 464 subjects (AD, MCI and CN) at the selected voxels, corrected for sex 
and age. 

2.3 Genotype data 

Genotypes for the 464 subjects in the study were obtained from the ADNI database. ADNI 
genotyping is performed using the Human610-Quad Bead-Chip, which includes 620,901 SNPs and 



copy number variations (see Saykin et al. (2010 ) for details). SNPs defining the AP0Ee4 variant are 
not included in the original genotyping chip, but have been genotyped separately by ADNI. These 
were added to the final genotype dataset. Subjects were unrelated, and all of European ancestry, 
and passed screening for evidence of population stratification using the procedure described in 



(Stein et al., 2010). We included only autosomal SNPs in the study, and additionally excluded 



SNPs with a genotyping rate < 95%, a Hardy- Weinberg equilibrium p- value < 5 x 10 ^, and a 
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minor allele frequency < 0.1. Finally, since our method does not allow for missing SNP minor 
allele counts, missing genotypes were imputed (see Vounou et al (2011) for details). 434,271 SNPs 
remained after all SNP filtering steps described above. 



2.4 SNP to pathway mapping 



Our SNP mapping procedure rests on the extraction of prior information from a pathways database 
that provides curated lists of genes, mapped to functional networks or pathways. Pathways 
databases such as those provided by KEGG (|http : //www . genome . j p/kegg/pathway . html ) , Re- 
actome (http://www.reactome.org/) and Biocarta ( http: //www.biocarta. com/[ ) typically clas- 
sify pathways across a number of functional domains, for example apoptosis, cell adhesion or lipid 
metabolism; or crystallise current knowledge on specific disease-related molecular reaction net- 
works. 

Starting with a list of all genes that map to at least one pathway in the database, we assign SNPs 
to genes within a specified distance, upstream or downstream of the gene in question, and thence 
to pathways. This process is illustrated schematically in Fig. [l] For our AD pathways study, we 



pathways 



genes 



genotyped SNPs 





□ □ 



31 




□ □ 



Figure 1: Schematic illustration of the SNP to pathway mapping process, (i) Known genes (green circles) 
are mapped to pathways using information on gene-gene interactions (top row), obtained from a gene 
pathways database. Many genes do not map to any known pathway (unfilled circles). Also, some genes 
may map to more than one pathway, (ii) Genes that map to a pathway are in turn mapped to genotyped 
SNPs within a specified distance. Many SNPs cannot be mapped to a pathway since they do not map 
to a mapped gene (unfilled squares). Note SNPs may map to more than one gene. Some SNPs (orange 
squares) may map to more than one pathway, either because they map to multiple genes belonging to 
different pathways, or because they map to a single gene that belongs to multiple pathways. 



proceed as follows. A list of 21,004 human gene chromosomal locations, corresponding to human 
genome assembly GRCH36 was obtained using Ensembl's BIOMART API (Iwww.biomart . org). 
SNPs were then mapped to any gene within 10k base pairs, upstream or downstream of the 
gene in question. This resulted in 211,106 SNPs being mapped to 18,405 genes. While the 
majority of known genes did map to at least one SNP in our study, approximately half of the 
SNPs passing QC were not located within lOkbp of a known gene. For pathway mapping, we used 
the KEGG canonical pathway gene sets obtained from from the Molecular Signatures Database 
v3.0 (http://www.broadinstitute.org/gsea/msigdb/index.jsp), which contains 186 gene sets, 
which map to a total of 5,267 distinct genes, with many genes mapping to more than one pathway. 
Note that only around 25% of all known genes map to a pathway in this dataset. We map all SNPs 
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within lOkbp of one or more of the 5,267 pathway-mapped genes to the pathway(s) concerned. 
Finahy, we exclude the largest pathway, by number of mapped SNPs, ('Pathways in Cancer') that 
is highly redundant, in that it contains multiple other pathways as subsets. This results in 66,162 
SNPs mapped to 4,425 genes and 185 pathways (see Fig. |2|. 



ADNI QC'd SNPs 

434,271 SNPs 



SNP to gene nnapping 



Genes: GRCH36/hgl8 
21,004 genes 




211,106 SNPs mapped to 
18,405 genes within 10l<bp 



Pathways: KEGG 
186 Pathways containing 
5,267 distinct genes 



SNP to pathway mapping 



Exclude largest, highly redundant, pathway 



66,162 SNPs mapped to 4,632 genes and 185 pathways 



overlap expansion 



P* = 175,544 SNPs mapped to 185 pathways 



Figure 2: Mapping SNPs to pathways 



The distribution of pathway sizes in terms of the number of SNPs that they map to is illustrated 
in Fig. [3] (left). Pathway sizes range from 57 to 5,111 SNPs (mean 949). The distribution of 
overlapping SNPs, that is the number of pathways to which each SNP is mapped, is illustrated in 
Fig. [3] (right). This ranges from 1 to 45 pathways (mean 2.65). 

Note that following the above procedure, some genes previously implicated in AD studies do 
not map to any pathways, and thus are not included in the analysis. For example, in this study, 12 
out of 30 genes highlighted in the review by Braskie et al (2011) are mapped to pathways. Also 
note that since SNPs are mapped to all genes within a range of lOkbp, AD implicated SNPs may 
map to more than one gene, and its corresponding pathway (s). This is the case for example with 
a number of SNPs mapping to the APOE and TOMM40 genes. This information is summarised 
in Table [2 



2.5 Pathways sparse reduced-rank regeression 

We consider the problem of identifying gene pathways associated with a multivariate quantitative 
trait (MQT) or phenotype, y G M^. The observed values for phenotype g, measured for 
unrelated individuals, are arranged in an {N x 1) response vector y^, and the Q phenotypes are 
arranged in an {N x Q) response matrix Y = (yi, . . . ,yQ). We assume minor allele counts for 
P SNPs are recorded for all individuals, and denote by Xij the minor allele count for SNP j on 
individual i. These are arranged in an {N x P) genotype design matrix X. We additionally assume 
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Figure 3: Left: Pathway sizes. Distribution of KEGG pathways, by the number of ADNI SNPs that they 
map to. Right: SNP overlaps. Distribution of ADNI SNPs, by the number of pathways that they map 
to. SNPs map to multiple pathways either because they map to a gene that belongs to more than one 
pathway, or because they map to more than one gene belonging to more than one pathway. 



Table 2: AD genes included in this study. 12 out of 30 genes previously implicated with AD fBraskie] 
et al. 2011|) that are included in this study are listed in the left hand column. These are genes that (a) 



map to a KEGG pathway and (b) have a genotyped SNP within lOkbp. The right hand column shows 
neighbouring genes that map to one or more SNPs mapping to the respective AD implicated gene. 



Implicated gene Mapped genes in study 



TO MM 40 


TOMM40 APOE PVRL2 


ACE 


ACE 


EPHA4 


EPHA4 


CCR2 


CCR2 CCR5 


APOE 


TOMM40 APOE PVRL2 


FAS 


FAS 


CHRNB2 


ADAR CHRNB2 


EFNA5 


EFNA5 


LDLR 


LDLR 


CRl 


CRl CR2 


GRIN2B 


GRIN2B 


IL8 


IL8 
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all phenotypes and genotypes are mean centred, and that SNP genotypes are standardised to unit 
variance, so that xf^- = 1, for j = 1, . . . , P. 

If we denote by C = (Ci, . . . , Cq), a (P x Q) matrix of regression coefficients, then we can 
model the multivariate response as 

(1) Y = XC + E 

where E is an {N x Q) matrix of error terms. A least squares estimate for C may be obtained by 
generalising the multiple least squares optimisation to include a multivariate response, that is by 
minimising the residual sum of squares 

(2) y^MMLR ^ ^^i^Y - XC)(Y - XC)'}. 

Where N > P and the design matrix X is of full rank, the least squares estimates are given by 
C = (X'X"^)X'Y. Note that the (P x 1) column vectors Ci,...,Cq of C are just the least 
squares estimates of the regression of each on X, that is 

(3) Cq = aTgmm\\yq-XCq\\l q = l,...,Q 

where || • ||2 denotes the £2 (Euclidean) norm. 

For high-dimensional datasets, such as those typically found in genomics, this model is unsuit- 
able for a number of reasons. Firstly, P ^ A/", so that X'X is singular and thus not invertible and 
the estimates are not uniquely defined. Even where P < N ^ for example in a candidate gene 
study, LD or equivalently near multi-collinearity between predictors means that X'X is nearly 
singular, resulting in inflated variance in SNP coefficient estimates. Furthermore, the estimation 
(|3| is equivalent to performing Q independent regressions, and takes no account of the multivariate 
nature of Y. Ideally, we would like to exploit this in our estimation procedure to boost power 



(Breiman and Friedman 1997; Vounou et al, 2010). 



These limitations are addressed in reduced-rank regression (RRR), (Reinsel and Velu, 1998 



Hastie et al. 2008), by restricting the rank of the coefficient matrix C. Specifically we impose the 
constraint that C has rank r < min(P, Q), and rewrite C as C = BA, where A and B both have 
(full) rank r. The reduced rank form of ([T]) is then given by 

(4) Y = XBA + E 

where B and A are (P x r) and (r x Q) matrices of regression coefficients respectively relating to 
genotypes and phenotypes. This model has the interesting interpretation of exposing r hidden or 
latent factors^ which capture the major part of the relationship between Y and X. If we denote 
by B(/g), the kih. column of B, then we see that the products XB(/.),/c = l,...,r, represent r 
linear combinations of the P predictor variables. Similarly, the r row vectors, A^/^), /c = 1, . . . , r, 
represent the transformation of each of these back to the dimensions of Y, so that they can predict 
the response. The linear combinations XB(/g) and YA^^.^ thus represent a reduced set of r (latent) 
factors that capture the relationship between response and predictors, reduced in the sense that 
this set has dimensionality r < min(P, Q). 
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As a first approximation, we consider tiie rank-1 RRR model wiiicii captures tiie first set of 
genotype and piienotype latent factors describing the association between X and Y. With r = 1, 
we rewrite Q as 

(5) Y = Xba + E 

where b and a are (P x 1) and (1 x Q) coefficient vectors respectively relating to genotypes and 
phenotypes. Least squares estimates for b and a are then obtained by minimising the rank-1 
equivalent of 

(6) M^^^^ = Tr{(Y - Xba)r(Y - Xba)'} 

where F is a given {q x q) positive definite matrix of weights. The choice of T reflects how we 
deal with correlation between the responses yi,...,yg in the least squares optimisation. Such 
correlations can be exploited by setting T to be the inverse of the estimated covariance of the 
responses. In the context of imaging genetics for example, where a voxel- wise multivariate response 
may be derived from structural MRI, spatial correlations between phenotypes are expected in 
part to reflect common genetic variation. However, the calculation of the inverse (Y'Y)~^ is 



computationally very intensive, so in common with Vounou et al (2010), we instead use the 
simplifying approximation F = Ig, effectively assuming the responses to be uncorrelated. 

We now turn to the case where all P SNPs may be mapped to L groups, Qi C {1, . . . ,P}, 



/ = 1,...,L, for example by mapping SNPs to gene pathways (see section 2.4). We begin by 
assuming that pathways are disjoint or non-overlapping, that is Qi f) Qi' =0 for any I ^ V . We 
denote the rank-1 vector of SNP regression coefficients by b = (6i,...,6p). We additionally 
denote the matrix containing all SNPs mapped to pathway Qi by X^ = {^h •> • • • ^ ^Si)^ where 
Xj = (xij, X2j, . . . , XNjY , is the column vector of observed SNP minor allele counts for SNP j, and 
Si is the number of SNPs in Qi. Finally, we denote the corresponding vector of SNP coefficients 
by b^ = {hi^,hi^,...,hsi). 

In general, where P is large, we expect only a small proportion of SNPs to be 'causal', in 
the sense that they exhibit phenotypic effects. We further assume that causal SNPs will tend 
to be enriched within functional groups, or gene pathways. This latter assumption is illustrated 
schematically in Fig. |4j where causal SNPs (marked in grey) tend to accumulate within a small 
number of causal pathways, while the majority of pathways contain no causal SNPs. A model that 
generates such a sparsity pattern is said to be group-sparse^ in that SNPs affecting Y are to be 
found in a set C C {1, . . . , L} of causal gene pathways (groups), with \C\ <C P, where \C\ denotes 
the cardinality of C. We seek a parsimonious model that is able to identify this set, C, of causal 
pathways, by imposing a group- sparsity constraint on the estimated SNP coefficient vector, b. 

In sparse reduced-rank regression (sRRR) ([Vounou et al., 2010, 2011), sparse estimates for 



genotype and/or phenotype coefficient vectors are obtained by imposing a regularisation penalty 
on b and/or a respectively. Apart from the benefits of model parsimony, enforcing a sparsity 
constraint on b also allows us to deal with the P ^ N case, and with multicollinearity between 
predictors. In our proposed 'Pathways Sparse Reduced-Rank Regression' (PsRRR) model, the 



required group sparsity pattern is obtained by imposing an additional group lasso penalty (Yuan 



and Lin 2006) on ([6|. Group-sparse solutions to the rank-1 RRR model ^ are then obtained by 
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Figure 4: Group-sparse distribution of causal SNPs. The set S C {1, . . . , P} of causal SNPs influencing 
the phenotype are represented by boxes that are shaded grey. Causal SNPs are assumed to occur within 
a set C of causal pathways. Here C = {2,3}. Note that the particular distribution of causal SNPs may 
vary for each individual, i = 1, . . . , N . The group sparsity assumption is that \C\ <^ L. 



minimising the following penalised least squares problem 

1 ^ 

(7) y^PsRR.R ^ _Tr{(Y - Xba)(Y - Xba)'} + A^^,||b,||2 

with respect to b and a. ^ corresponds to an ordinary least squares (OLS) optimisation, but 
with an additional group- wise penalty whose size depends on ||b^||2, / = 1, . . . , L, a regularisation 
parameter A, and an additional group weighting parameter ui that can vary from group to group. 
Depending on the value of A, this penalty has the effect of setting multiple pathway SNP coeffi- 
cient vectors, hi = 0,/ C {!,..., L}, thereby enforcing group sparsity. Pathways with non-zero 
coefficient vectors form the set C of selected pathways, so that 

C(A) = {l:hi^ 0}. 

Expanding ([7|), and noting that the first term YY' does not depend on b or a, solutions satisfy 

1 ^ 

(8) b, a = arg min { - (-2aY'Xb + aa'b'X'Xb) + A ^ | |bz 1 12 } . 

b,a ^ 

For fixed a, this penalised least squares problem equates to a convex optimisation in b, and is 
thus amenable to solution using coordinate descent ( Friedman et a/.[|2QQ7 ). A global solution can 



then be obtained by iteratively estimating one coefficient vector (b or a), while holding the other 
fixed at its current value, until convergence (Chen and Chan, 2Q12| ). 



Thus, for fixed b and A, and with the additional constraint that bb' = 1, we estimate a as 

1 ^ 
a = arg min { - (-2aY'Xb + aa'b'X'Xb) + A ^ | |b| 1 12 } . 



^=1 



11 



Differentiating and setting to zero gives 



b'X^Xb 

Similarly, for fixed a, and with the additional constraint that aa' = 1, we have 

L 

^ ;-2a VXb + b'X'Xb^ + A " 



1 ^ 
(9) b = argmin{-(-2aY'Xb + b'X'Xb) + A^^^llb^lls}. 



This is equivalent to a standard group lasso estimation problem with univariate response vector 
Ya^ In earlier work we describe a method, 'Pathways Group Lasso with Adaptive Weights' (P- 
GLAW), for solving this problem, specifically tailored to the situation where predictor variables 



are SNPs grouped into pathways (Silver and Montana 2012). Here, we briefly recap key points 



of this method, and incorporate a number of extensions designed to accommodate a MQT in the 
context of PsRRR with coordinate descent. 

The minimising function ^ is convex, and can be solved using block coordinate descent (BCD) 



(Friedman et al, 2010), an extension of coordinate descent to convex estimation with grouped 
variables. BCD rests on obtaining successive estimates, b^, for each pathway in turn, while keeping 
current estimates for all other pathways, b/e. A: 7^ constant, until a global minimum is obtained. 
For pathway Qi^l = 1, . . . , L, estimates for each SNP coefficient, bj^j = /i, . . . , Isi are obtained 
through coordinate descent within the group. The group lasso estimation algorithm using BCD is 
presented in Box[l] 

Box 1 l](a, Y,X,A): GL estimation algorithm using BCD 

1. b^O 

block coordinate descent 

2. repeat 

3. for / = 1,2,...,L 

5. if ||X;r^||2 < A^^ 

6. bi^O 

7. else 

coordinate descent within block 

8. repeat 

9. for j = li,...,lsi 
10. r ^ Ya - X6 



11. ^ 



l+T 



"I|bill2 

12. until b^ converges 

13. until b converges 



As A increases, fewer groups (or pathways) are selected by the model (Box[T] step 5), while 
for selected pathways with b^ 7^ 0, estimated SNP coefficients, bj^j = /i, . . . ,5^, tend to shrink 
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towards zero (Box[l] step 11). 

The full PsRRR estimation algorithm is presented in Box [2] 



Box 2 Rank-1 PsRRR estimation algorithm using coordinate descent 



a^l/||l||2 
repeat: 

A ^ ^\max, where Xmax minA{A : ||XfYa ||2 A^^, / = 1, . . . ,L} 

b ^ l](a, Y, X, A) (from Box[l]) 

b ^ b/ 1 1 b 1 1 2 (normalise) 
a^ b'X Y 



b'X'Xb 

a ^ a/||a||2 (normalise) 
until b and a converge 



Note that we set the regularisation parameter, A, to be a constant fraction (7) of the maximal 
value, \maxi where no groups are selected by the model. 

A key feature of our P-GLAW method is the need to accommodate the fact that pathways 
overlap, that is Qi f) Qi' 7^ for some I ^ l\ since SNPs may map to multiple pathways. To 



enable the independent selection of pathways, we instead require that groups are disjoint (Jacob 
[et al.^ |2QQ9| ). This is achieved through an expansion of the design matrix, X, formed from the 
column- wise concatenation of the L sub-matrices of size {N x Si), so that X = [Xi, X2, . . . , X/,]. 
This expanded X has dimensions {N x P*), with P"^ = ^iSi. A corresponding expansion of the 
parameter vector, b = [bi \ b2\ . . . , b^.^ is also required. 

Another issue that we address is the problem of pathway selection bias, by which we mean the 
tendency of the group lasso to favour the selection of specific pathways, under the null, where no 
SNPs influence the phenotype. Such biases can arise for example from variations in the number 
of SNPs or genes in pathways, and varying patterns of dependence (LD) between SNPs within 
pathways. Under the null, with the regularisation parameter A tuned so that a single pathway 
is selected, pathway selection probabilities should follow a uniform distribution, namely with 
probability 11/ = 1/L, for / = 1, . . . , L. However, where biasing factors are present, the empirical 
probability distribution, 11* will not be uniform. Our iterative weight tuning procedure works by 
applying successive adjustments to the pathway weight vector, uj = (cji, . . . ,co'l), so as to reduce 
the difference, di = n^*(cj) — H^, between the unbiased and empirical (biased) distributions for each 
pathway. We begin with an initial weight vector, cjf^^ = ^/Si, which corrects for the biasing effect 



of group size in the group lasso model (Silver and Mont anal 2012). At iteration r, we compute 



the empirical pathway selection probability distribution n*(u;^'^^) over multiple model fits with 
permuted phenotypes, and compute di for each pathway. We then apply the following weight 
adjustment 



w 



[1 - sign{di){r] - l)L^d^] < 7/ < 1, / = 1, . . . , L 



where the paramater 77 controls the maximum amount by which each wi can be reduced in a single 
iteration, in the case that pathway Qi is selected with zero frequency. The square in the weight 
adjustment factor ensures that large values of \di\ result in relatively large adjustments to wi. 
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Iterations continue until convergence, where Xlz^i l^d < ^• 

Even when relatively few SNPs or genes are associated with the phenotype, we can expect 
multiple pathways to harbour genetic effects since many SNPs and genes overlap multiple path- 
ways. Where more than one pathway is selected by the model, we therefore expect that pathway 
selection probabilities will not be uniform, since the presence of overlapping SNPs means that 
pathways are not independent. Instead, selection probabilities will reflect the pattern of overlaps 
corresponding to the distribution of causal SNPs (or spurious associations under the null). This 
non-uniform distribution of selection probabilities is to be expected and is in fact desirable, since 
a signal corresponding to causal SNPs or genes should be captured in each and every pathway 
that contains them. We have shown in extensive simulation studies, that where more than one 
pathway is selected, the weight tuning process described above leads to substantial gains in both 



sensitivity and specificity when identifying causal pathways (Silver and Montana, 2012). 

Estimates for b and a respectively represent the first (rank 1) latent factors that are expected to 
capture the strongest signal of association between gene pathways and the phenotype. In principle, 
it is possible to capture further latent factors of diminishing importance, by iteratively repeating 



the procedure described above, after regressing out the effects of previous factors (Vounou et al 



2010). With PsRRR, the estimation of further ranks is complicated by the need to recalibrate the 



group weights at each step, and by the typically large number of SNPs in selected pathways. For 
this reason we consider only the first latent factor in this study. 

2.6 Pathway, gene and SNP ranking 
Pathway ranking 

With most variable selection methods, a choice for the regularisation parameter. A, must be made, 
since this determines the number of variables selected by the model. Common strategies include the 
use of cross validation to choose a A value that minimises the prediction error between training and 



test datasets (Hastie et al. 2008). One drawback of this approach is that it focuses on optimising 



the size of the set, C, of selected pathways (more generally, selected variables) that minimises 
the cross validated prediction error. Since the variables in C will vary across each fold of the 
cross validation, this procedure is not in general a good means of establishing the importance of a 



unique set of variables (Vounou et al. 2011). Alternative approaches, based on data resampling or 



bootstrapping have been demonstrated to improve model consistency, in the sense that the 'true' 



variables are selected with a high probability (Bach, 2008 Meinshausen and Biihlmann, 2010). We 



adopt a resampling approach, in which we calculate pathway selection frequencies by repeatedly 
fitting the model over B subsamples of the data, at a fixed value for A. Selection frequencies 
are expected to be relatively insensitive to the choice of A, provided that it is small enough to 
ensure that the set C of causal pathways is selected with a high probability at each subsample 



(Meinshausen and Biihlmann 2010). 



We denote the set of selected pathways at subsample b by 

C(^) ={/:b[^^ ^0} 6=1,..., 5 



where bj^^ is the estimated SNP coefficient vector for pathway / at subsample b. The selection 
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probability for pathway / measured across all B subsamples is then 

-r = ;^E4''^ 1 = 1,. ..,L 

6=1 

where the indicator variable, ij^^^ = 1 if / G C^^\ and otherwise. Pathways are ranked in order 
of their selection probabilities, tt^^^^ nf^^^. 



SNP and gene ranking 

The PsRRR model is designed to identify important pathways which may contain multiple genetic 
markers with varying effect sizes. However, it is still interesting to establish which SNPs and genes 
are most predictive of the response amongst those mapped to the set C*^^^ of selected pathways at 
subsample b. Note that these are not necessarily the SNPs and genes that are driving the selection 
of any particular pathway in the PsRRR model. 

To rank SNPs and genes, we perform a second level of variable selection using sRRR with a 



lasso penalty (Vounou et al, 2011). We first form the reduced {N x Z^^^) matrix X^(5), with 
columns {Xj : j G U/gc(^) corresponding to all SNPs in pathways selected at subsample h. 
Sparse estimates for the corresponding SNP coefficient vector, /3, and rank-1 phenotype vector a 
then satisfy the equivalent of ([8| with a lasso penalty, namely 

= argmin{i(-2aVX^^(,)/3 + aa'/3'X^(,)X^>)/3) + A||^ 

We denote the set of SNPs selected at sample h by S^^\ and further denote the set of selected 
genes to which the SNPs in S^^^ are mapped by (j)^^^ C where $ = {1, . . . , G} is the set of gene 
indices corresponding to all G mapped genes. Using the same strategy as for pathway ranking, 
we obtain an expression for the selection probability of SNP j across B subsamples as 



1 ^ 

.SNP ^ r^^) 
j B^ ^ 

6=1 



where the indicator variable, /j^^ = 1 if j G S^^\ and otherwise. A similar expression for the 
selection probability for gene g is 

1 ^ 
R ^9 



B 

6=1 



where the indicator variable, I^^ = 1 if ^ G 0^^^, and otherwise. SNPs and genes are then ranked 
in order of their respective selection frequencies. 

2.7 Computational Issues 

All computer code is written in the open source Python programming language, using Numpy and 
SciPy modules which are optimised for efficient operation with large matrices. Execution of the 
PsRRR estimation algorithm nonetheless presents a considerable computational burden, both in 
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terms of processor time and memory use. We therefore implement a number of strategies designed 



to increase computational efficiency (see Silver and Montana (2012) for details). We use a Taylor 



approximation of the group penalty that avoids the need for computationally intensive numerical 



search methods (Breheny and Huang 2009 Friedman et al. 2010). In addition, we use an 'active 



set' strategy (Tibshirani et al., 2010; Roth and Fischer, 2008), that identifies a subset of pathways 



that are more likely to be selected by the model at a given A. Model estimation then proceeds 
with this reduced set, followed by a final check to ensure that no other pathways should have 
been included in the active set in the first place. Depending on the choice of A, this can lead 
to substantial gains in computational efficiency and a large reduction in memory requirements, 
resulting from the very much reduced size of X in l](a, Y, X, A). 

The need to fit a large number of PsRRR models over multiple subsamples of the data for 
pathway ranking presents another major computational bottleneck. However, the fact that each 
subsample is generated entirely independently presents an opportunity for performing multiple 
model fits in parallel. We implement such a strategy using a computer cluster, in which a single 
client node distributes subsamples across 8 server nodes. Parallel computations and client-server 



communication are implemented in Parallel Python ( http : //www . parallelpython . com/ ) . The 



reduction in computation time due to parallelisation is considerable. For example, in the AD 
study described in this paper, total execution time (excluding weight tuning) with B = 1000 
subsamples was 6^ hours, whereas total execution time if each job were run separately would be 
approximately 10 1 days. 

3 Results 

3.1 AD associated phenotypes 

An imaging signature characteristic of AD was created using the procedure described in section 



2.2| As described previously, we begin by computing a linear least-squares fit of the longitudinal 
structural change across 3 time points at each voxel. An illustration of average slope coefficients, 
and their variation between subjects, is shown in Fig.jsj Increased expansion of ventricular volumes 
is clear in all subjects, but this increase is most marked in AD patients, where ventricular volumes 
expand by an average 1.2% per year (white regions in left hand part of Fig. |5|. AD patients also 
show the most variation in structural change over time. 

A statistical image showing the corresponding ANOVA p-values, a measure of the extent to 
which each voxel is able to discriminate between ADs and CNs, is shown in the top row of Fig. [6j 
From the Q* = 2, 153, 231 voxels in this image, we extract a final set of Q = 148, 023 voxels whose 
p-values exceed a Bonferroni-corrected threshold of O.OS/Q*. This final set of voxels that are most 
discriminative between ADs and CNs, are highlighted in yellow in the bottom row of Fig.[6j These 
Q voxels constitute the phenotype for each subject used in the study. We visualise the Euclidean 
distances between subjects using the selected voxels in a 3D multi-dimensional scaling plot in 
Fig. in 

To verify the power of the set of selected voxels to discriminate between ADs and CNs, we 
used a linear classifier, with Gaussian, class-conditional densities and common diagonal covariance 
matrices. The performance of the linear classifier was assessed using 10-fold cross-validation, giving 
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Figure 5: Sample mean (left) and standard deviation (right) of slope coefficients for the 3 subject groups. 
Slope coefficients represent a linear approximation of change in brain volume over time. Scales represent 
10 X percentage change in voxel volume per year, so that for example a slope coefficient of 12 (white areas 
in left hand plot) is equivalent to an average yearly increase in voxel volume of 1.2%. 




Figure 6: Imaging signature characteristic of AD. Top: Statistical image showing p-values (— log^Q scale) 
obtained from an ANOVA on the linear structural change over 3 time points, corrected for age and sex, to 
discriminate between AD and CN subjects. Bottom: The final set of Q = 148, 023 selected voxels with p- 
values exceeding a Bonferroni-corrected threshold as = 0.05/2153231, (— log^Q as = 7.6) are highlighted 
in yellow. 



17 



-1000 -3000 



Figure 7: 3D multi-dimensional scaling plot illustrating the spread of imaging signatures across ADs and 
CNs. Imaging signatures correspond to selected voxels only. 



figures for accuracy, sensitivity and specificity of 85.0%, 78.8% and 89.0%, respectively. While 
optimisation of classification performance was not our primary concern, these figures compare 



well with a range of high dimensional classification methods cited in the literature (Liu et al. 



2012 Cuingnet et al , 2011) 



3.2 Pathway, SNP and gene rankings 

We use the PsRRR algorithm described in section [23] to identify KEGG pathways associated with 
the AD-discriminative longitudinal phenotypes described in the preceding section. Pathways are 
ranked in order of importance using the pointwise stability selection method described in section 
|2.6[ with B = 1000 subsamples. We use A = O.SXmaxi which results in the selection of an average of 
7 pathways at each subsample (min 1, max 15, SD = 2.3). Pathway ranking results are presented 
in Tabled 
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Table 3: Top 30 pathways, ranked by pathway selection frequency. 



Rank 


KEGG pathway name 




Size 


Lasso selected genes in pathway-*^ 


Known AD genes^ 








(# SNPs) 




in pathway 


1. 


Chemokine signaling pathway 


0.261 


2769 


PRKCB PIK3R3 PIK3CG ADCY8 ADCY2 ITK GNAIl XCLl GNG2 GRK5 


GGR2 IL8 


2. 


Jak stat signaling pathway 


0.234 


1311 


PIK3R3 PIK3GG IL2RA 




3. 


Tight junction 


0.227 


3332 


PRKGB PRKGA YESl AG TNI GNAIl GTNNA2 




4. 


Insulin signaling pathway 


0.218 


1517 


PIK3R3 PIK3GG HK2 G6PG AG AG A 




5. 


Leukocyte transendothelial migration 


0.213 


2289 


PRKGB PIK3R3 PRKGA PIK3GG AGTNl ITK GNAIl GTNNA2 




6. 


Leishmania infection 


0.204 


620 


GRl PRKGB 


GRl 


7. 


Calcium signaling pathway 


0.202 


5111 


PRKGB PRKGA ADGY8 ADGY2 MYLK ATP2B2 RYR2 SLG8A1 




8. 


Complement and coagulation cascades 


0.184 


783 


GRl 


GRl 


9. 


Vibrio cholerae infection 


0.174 


831 


PRKGB PRKGA KGNQl 




10. 


Cytokine cytokine receptor interaction 


0.163 


2267 


XGLl IL2RA 


FAS GGR2 IL8 


11. 


Citrate cycle tea cycle 


0.157 


210 






12. 


Focal adhesion 


0.154 


4009 


PRKGB PIK3R3 PRKGA PIK3GG AGTNl MYLK GOL5A3 




13. 


Alzheimers disease 


0.138 


2500 


APOE 


APOE FAS GRIN2B 


14. 


Arrhythmogenic right ventricular cardiomyopathy arvc 


0.136 


3229 


AGTNl RYR2 GTNNA2 SLG8A1 




15. 


Phosphatidylinositol signaling system 


0.133 


2067 


PRKGB PIK3R3 PRKGA PIK3GG DGKA DGKB DGKI 




16. 


Small cell lung cancer 


0.116 


1610 


PIK3R3 PIK3GG 




17. 


Pyruvate metabolism 


0.115 


456 


AGAGA 




18. 


Glycerophospholipid metabolism 


0.113 


1047 


DGKA DGKB DGKI 




19. 


Glycolysis gluconeogenesis 


0.111 


611 


HK2 G6PG PFKP 




20. 


Propanoate metabolism 


0.108 


471 


AGAGA 




21. 


Fc gamma r mediated phagocytosis 


0.102 


1976 


PRKGB PIK3R3 PRKGA PIK3GG 




22. 


Huntingtons disease 


0.100 


1980 




GRIN2B 


23. 


Abe transporters 


0.099 


1701 






24. 


Hematopoietic cell lineage 


0.097 


905 


GRl IL2RA 


GRl 


25. 


Regulation of actin cytoskeleton 


0.094 


3371 


PIK3R3 PIK3GG AGTNl MYLK 




26. 


Aldosterone regulated sodium reabsorption 


0.091 


744 


PRKGB PIK3R3 PRKGA PIK3GG 




27. 


Toll like receptor signaling pathway 


0.091 


712 


PIK3R3 PIK3GG 


IL8 


28. 


Proximal tubule bicarbonate reclamation 


0.081 


214 






29. 


Melanogenesis 


0.078 


1638 


PRKGB PRKGA ADGY8 ADGY2 GNAIl 




30. 


Drug metabolism cytochrome p450 


0.078 


669 







■•^Top 30 ranked genes in this pathway, using lasso selection (see Table [41. ^Previously identified AD genes in the pathway (see Table [2^ 



SNPs and genes are ranked using sRRR with a lasso penalty on the SNP coefficient vector, as 
described in section [2^ Lasso selection is performed on pathways selected at each subsample in 
the pathways analysis described above, so that once again B = 1000. The number of SNPs, Z^^\ 
included in the lasso model at subsample b varies according to the number and size (in terms of 
the number of mapped SNPs) of selected pathways. Z^^^ ranges from a minimum of 132, to a 
maximum of 21,158 (mean = 9,835; SD = 3,729). As with pathway ranking, we use A = O.SXmax^ 
which results in the selection of an average of 13.1 SNPs at each subsample (min 1, max 75, SD 
= 15.5). 

Table 4: Top 30 SNPs and genes, respectively ranked by SNP and gene selection frequency, using lasso 
sRRR. Note the APOE gene is selected at a lower frequency than the AP0e4 SNP, since in a significant 
minority of subsamples the allele is selected in a pathway where it is mapped to the TOMM40 gene only 



Rank 


SNP 


SNP RANKING 
^SNP Mapped gene(s) 


Gene 


GENE RANKING 

^gene ^ mapped SNPs 


1 


rsllll8131 


0.302 


CRl 


CRl 


0.302 


21 


2 


rs650877 


0.302 


cm 


PRKCB 


0.284 


73 


3 


rsl2734030 


0.302 


CRl 


PIK3R3 


0.219 


9 


4 


rsllll7959 


0.302 


CRl 


PRKCA 


0.188 


99 


5 


rs4788426 


0.284 


PRKCB 


PIK3CC 


0.18 


9 


6 


rsll074601 


0.257 


PRKCB 


YESl 


0.171 


11 


7 


rs677066 


0.242 


CRl 


ADCY8 


0.159 


69 


8 


rs6691117 


0.242 


CRl 


ADCY2 


0.159 


106 


9 


rsl052610 


0.219 


PIK3R3 


HK2 


0.148 


28 


10 


APOE4 


0.188 


APOE, TOMM40 


DCKA 


0.139 


3 


11 


rs4622543 


0.188 


PRKCA 


ACTNl 


0.138 


41 


12 


rs9896483 


0.18 


PRKCA 


APOE 


0.138 


4 


13 


rs4730205 


0.18 


PIK3CG 


ITK 


0.125 


27 


14 


rsl2185470 


0.171 


YESl 


CNAIl 


0.122 


22 


15 


rsl2185469 


0.171 


YESl 


XCLl 


0.122 


7 


16 


rsl7516202 


0.171 


YESl 


IL2RA 


0.119 


44 


17 


rsl3189711 


0.159 


ADCY2 


MYLK 


0.117 


24 


18 


rs263264 


0.159 


ADCY8 


ATP2B2 


0.111 


135 


19 


rs680545 


0.148 


HK2 


COL5A3 


0.106 


14 


20 


rsl0876862 


0.139 


DGKA 


KCNQl 


0.103 


118 


21 


rs772700 


0.139 


DGKA 


RYR2 


0.098 


190 


22 


rs4902672 


0.138 


ACTNl 


CTNNA2 


0.09 


421 


23 


rsl81455 


0.13 


ACTNl 


GNG2 


0.088 


31 


24 


rsl3184646 


0.125 


ITK 


C6PC 


0.086 


6 


25 


rs6973616 


0.122 


GNAIl 


PFKP 


0.084 


51 


26 


rs2419114 


0.122 


XCLl 


CRK5 


0.084 


56 


27 


rs942201 


0.119 


IL2RA 


DGKB 


0.083 


200 


28 


rsll256448 


0.117 


IL2RA 


SLC8A1 


0.083 


192 


29 


rsl 107345 


0.117 


IL2RA 


ACACA 


0.08 


23 


30 


rsl254403 


0.117 


MYLK 


DGKI 


0.078 


49 



We first consider the pathway ranking results in Table |3] Under the null, where there is 
no association between phenotypes and genotypes, and with a single pathway selected by the 
model at each subsample, the expected pathway selection frequency distribution is uniform, with. 



^path _ iji^^ ^ 0.005. As explained in section 2.5, where more than one pathway is selected at 



each subsample the selection frequency distribution will depend on the distribution of causal SNPs 
and genes, and will not be uniform. For this reason, while we report pathway selection frequencies, 
^path^ the main focus is on pathway rankings. To aid interpretation of pathway rankings, for each 
pathway, we list those genes in the pathway that are ranked in the top 30 genes, selected by lasso 
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selection (in Table [4|. 

In the final column of Table |3] we list genes in the top ranked pathways that have previously 
been linked to AD. Both the number of such genes affecting phenotypes in this study, and the 
extent to which these genes may drive pathway selection are unknown. It is nevertheless interest- 
ing to consider whether these genes are significantly enriched amongst high-ranking pathways. To 
do this we calculate an average ranking for each 'AD gene' by taking the average rank achieved 
by all pathways containing the gene in question. We then derive an AD gene enrichment score 
by summing average AD gene ranks across all AD genes. A lower score thus indicates pathways 
containing AD genes tend to be ranked high. We compare this empirically derived score with the 
distribution of scores obtained by permuting pathway rankings 100,000 times. The null distribu- 
tion of this enrichment score (obtained by permutation), and the empirically observed value are 
compared in Fig. [Sj Finally, we compute a p-value for the null hypothesis that the empirically 
observed enrichment score has arisen by chance, as the proportion of enrichment scores obtained 
through permutation that are lower than the observed value. This gives a value p = 0.0094. 



empirical enrichment score 




700 



AD gene enrichment score 

Figure 8: Measure of extent to which genes previously linked to AD are enriched in highly-ranked pathways. 
The histogram shows the distribution of AD gene enrichment scores obtained when permuting pathway 
rankings 100,000 times. The vertical black line indicates the observed AD gene enrichment score using 
the true pathway rankings obtained in the study. From this we derive a p-value indicating the probability 
that the empirical AD gene enrichment score could arise by chance as p = 0.0094. AD-linked genes are 
those identified in |Braskie et al.^ ( ^2011| ) . 



4 Discussion 



We describe a method for the identification of gene pathways associated with a multivariate 
quantitative trait (MQT). Here, we extend previous work modelling a univariate response, where 
we showed that a multivariate, group-sparse modelling approach can demonstrate increased power 
to detect causal pathways, when compared to conventional approaches that begin by modelling 



individual SNP-phenotype associations (Silver and Montana 2012). We apply our method in an 



AD gene pathways study using imaging endophenotypes, but our method is not restricted to the 
case of biological pathways or imaging phenotypes, and can be applied to any data in which we 
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seek to identify sparse groups of predictors affecting a multivariate response. 

In any method modelling effects on an MQT, the use of a multivariate disease signature that 
is characteristic of the disease under investigation is important. This is especially so in the case of 
high-dimensional imaging phenotypes, where a poorly characterised imaging signature with low 



signal to noise ratio may show no advantage over a simple ROI average-based approach (Vounou 



et a/., 2011). In this study we extract an AD imaging phenotype that is highly discriminative of 



subjects with the disease, compared to controls. 

Of the pathways identified as being associated with these AD endophenotypes (see Table [3|, 
functions associated with many of the top 10 ranked pathways have been linked to aspects of 
AD biology described in the literature, including chemokine signalling, Jak stat signalling, tight 



junction protein action, calcium and insulin 


signalling (Xia and Hyman 1999 Kim et al., 


Huber et al , 2001 


de la Monte and Wands , 


2005 Steen et al , 


2005 Ravetti et al , 


2010). 



2003 



In order to better elucidate which genes may be driving pathway selection, we performed a 
follow up analysis designed to identify SNPs and genes in selected pathways that are separately 
associated with the phenotype (see Table [4|. Since these gene (and associated SNP) rankings 
are derived from lasso selection of all SNPs within selected pathways, irrespective of their 'group' 
structure within pathways, they are expected to capture larger, independent signals of association, 
and not necessarily all the salient signals within a particular pathway that may be driving pathway 
selection. In particular, the group lasso is designed to detect distributed signals that may not 
be highlighted using lasso selection. From this analysis, it is clear that the lipid kinase genes 
PIK3R3/PIK3CG, and the calcium-activated, phospholipid-dependent genes PRKCA/PRKCB 
are important in driving selection of many pathways in the top 30 ranks. All these genes have 
previously been linked in gene expression studies with /3-amyloid plaque formation in the AD brain 
(Liang et al 2008). Aside from the validated AD risk gene CRl (see below), other top-ranked 
genes occurring more than once in the top 10 ranking pathways, include ADCY2 and GNAIl^ 



both of which have also been associated with AD in gene expression studies (Ravetti et al 2010 



Taguchi et al 2005). 



Of particular note amongst top-ranking pathways is the citrate (tea) cycle, since this contains 
no genes identified in the separate gene-ranking analysis. This suggests that selection of this 
pathway might be driven by signals with distributed small effects, rather than signals with larger 
marginal effects. Interestingly, this pathway is ranked first if the analysis is run selecting only a 
single pathway at each subsample (data not shown). A number of studies have suggested links 
between aberrations in the tea cycle and AD ( [Atamna and Fre"y 2007). 

Turning to the SNP and gene rankings in Table |4j the top ranking gene is the complement 
component receptor gene CRl^ and many of the top ranking SNPs also map to this gene. CRl 
has been identified as one of the major AD risk genes (Lambert et al 2009 Sleegers et al, 2010), 
and has been associated with changes in hippocampal volume in AD (Bifii et al 2010). As well as 
the high-ranking PIK3R3/PIK3CG/PRKCA/PRKCB genes discussed above, the major AD risk 
and phenotype-related gene APOE^ and risk allele APOEeA are respectively ranked 10 and 12. In 
our study the APOE gene maps to a single pathway, the KEGG Alzheimer's disease pathway, and 
this pathway is selected in ^ 14% of subsamples. Notably, in all subsamples in which the KEGG 
Alzheimer's disease pathway is selected, the APOEeA allele is the sole selected SNP, confirming 
the known large marginal effect of this allele on AD phenotypes. The fact that the Alzheimer's 
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disease pathway is not ranked higher may reflect the fact that selection of this pathway is driven by 
the presence of this single, strong APOEeA signal, and as explained above, the model is designed 
to identify distributed signals across a pathway. The slightly higher ranking of the APOEeA SNP, 
relative to the APOE gene, reflects the fact that this SNP also maps to the TOMM40 gene, which 
occurs in a number of other pathways selected by the model. The TOMM40 and APOE genes 
are in LD, and there is evidence of their interaction in AD ( [Rajagopalan et 'aL\ 2012). 

Our model rests on a number of assumptions, and as a consequence will fail to detect a 
number of different association signals. For example, while our model implicitly accommodates 
the fact that SNPs and genes interact within functional pathways, we do not explicitly model 
interaction effects. Also, we make the simplifying assumption that voxel-wise measures of atrophy 
are uncorrelated. In reality, the phenotype will exhibit a complex correlation structure which 
will affect the association signal. |Vounou et al. (2010) have demonstrated that even under this 
simplifying assumption, significant gains in power can be achieved by modelling a multivariate 
phenotype, compared to a mass univariate modelling approach. Finally, our model is founded on 
the assumption that causal SNPs tend to accumulate within functional pathways, and as such 
is not designed to identify significant marginal effects, as evidenced by its failure to rank the 
high-risk APOE gene highly. For this last reason, any pathways analysis should be seen as being 
complementary to conventional GWAS approaches. 

This study also demonstrates some of the limitations of pathways studies in general. Many 
genes previously implicated in AD do not map to known pathways in our study, so that these 
genes and their associated SNPs, many of which are well validated, are excluded. This relative 
sparsity of gene-pathway annotations reflects the fact that our understanding of how the majority 
of genes functionally interact is at an early stage. As a consequence, annotations from different 
pathways databases often vary (Soh et al, 2010), and in any case are undergoing rapid change. 
Pathways studies also suffer from a lack of benchmark datasets to compare different methods, 
and inevitably fail to capture the true complexity of dynamic interactions between pathways, by 
taking a snapshot of a biological system at a given point in time (Khatri et al. 2012). 
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